rm(list=ls())
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
I’m going to work through the process of importing, cleaning, and transforming a dataset.
I’ll create a single script that will complete every stage. This means that I can repeat the process, or share it with others.
The script relies on the features of the dplyr library. I load this first, after cleaning my environment.
rm(list=ls())
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Then, I import my data.
url <- "https://www.dropbox.com/scl/fi/a9bad7bh74n9jazge29eg/netball_data_999.csv?rlkey=hattwdksw8ispfi6lymg0ox66&dl=1"
data <- read.csv(url)
rm(url)This code snippet contains three separate instructions.
First, I create an object in my environment called [url]. This tells R where my datafile is stored.
Then, I use the read.csv function to pull the file located at [url] into my environment. Notice that I do this by creating a new object called [data], which is the output of the function read.csv.
Finally, I remove the object [url] from the environment by calling the rm function.
Now, I inspect the data that has been pulled from dropbox. I use the head, str and summary functions to get a sense of the data.
head(data) # the first six rows ('observations') player_id position goals_scored assists turnovers
1 1 Goal Shooter 25 8 3
2 2 Goal Attack 18 12 999
3 3 Wing Attack 10 999 1
4 4 Center 999 22 5
5 5 Wing Defense 5 10 999
6 6 Goal Defense 12 999 2
str(data) # the variable types that R has ASSUMED'data.frame': 10 obs. of 5 variables:
$ player_id : int 1 2 3 4 5 6 7 8 9 10
$ position : chr "Goal Shooter" "Goal Attack" "Wing Attack" "Center" ...
$ goals_scored: int 25 18 10 999 5 12 999 30 15 7
$ assists : int 8 12 999 22 10 999 6 14 12 5
$ turnovers : int 3 999 1 5 999 2 3 4 999 1
summary(data) # descriptive statistics for each of the variables player_id position goals_scored assists
Min. : 1.00 Length:10 Min. : 5.00 Min. : 5.0
1st Qu.: 3.25 Class :character 1st Qu.: 10.50 1st Qu.: 8.5
Median : 5.50 Mode :character Median : 16.50 Median : 12.0
Mean : 5.50 Mean :212.00 Mean :208.7
3rd Qu.: 7.75 3rd Qu.: 28.75 3rd Qu.: 20.0
Max. :10.00 Max. :999.00 Max. :999.0
turnovers
Min. : 1.00
1st Qu.: 2.25
Median : 3.50
Mean :301.60
3rd Qu.:750.50
Max. :999.00
Data is often ‘messy’. We’ll cover how to deal with this in more detail in the next few weeks, but for now, assume that we want to represent missing data with the value NA.
In this dataset, ‘999’ represents an missing value.
First, we’ll replace these values with ‘NA’.
Note that R will do this automatically if the element is truly missing.
# Replace 999 with NA in the dataset
data[data == 999] <- NAThe simplest approach to dealing with missing data is to remove any observations with missing data (‘NA’) in any of the columns. We can use the na.omit function to do this.
# Remove observations with missing data
data_clean <- na.omit(data)Notice that, following good practice, I haven’t overwritten the original data frame. Rather, I’ve created a new data frame that is a ‘clean’ version of the old one.
Datasets often contain variables that we don’t need. To keep things tidy, I usually delete anything I won’t be using. The advantage of doing this in a script is that, if I change my mind later, I can just edit the script to retain the variable or variables.
# Remove variable [turnovers]
data_clean$turnovers <- NULL
# I'm going to reintroduce [turnovers]
turnovers_temp <- na.omit(data$turnovers)
extracted_players <- inner_join(data, data_clean, by = "player_id")
extracted_players$position.y <- NULL
extracted_players$goals_scored.y <- NULL
extracted_players$assists.y <- NULL
data_clean <- extracted_players
rm(extracted_players)Often, we want to rename variables, especially if the dataset has been provided by an external organisation.
In the following example, I change the name of the variable [position] to [player_position], [assists] to [player_assists] and [turnovers] to [player_turnovers]:
# Rename variables
names(data_clean) <- c("player_id", "player_position", "goals_scored", "player_assists", "player_turnovers")Last week, we talked about the importance of making sure that R understands the types of variables in our dataset.
In particular, anything that is going to be a grouping variable needs to be identified as a factor.
In our example, [player_position] is a factor, as we can use it to group players into different categories.
# Change variable types
data_clean$player_position <- as.factor(data_clean$player_position)Then, I can use the str function to check variable types. Notice that R has identified that there are two levels of the factor [player_position].
# Use the 'str' function to check variable types
str(data_clean)'data.frame': 3 obs. of 5 variables:
$ player_id : int 1 8 10
$ player_position : Factor w/ 2 levels "Goal Shooter",..: 1 1 2
$ goals_scored : int 25 30 7
$ player_assists : int 8 14 5
$ player_turnovers: int 3 4 1
I’ll now move to creating a new variable based on something done to existing variables.
Calculate a new variable, [data_clean$ratio], which is [goals_scored]/[player_assists].
# Calculate 'ratio' as [goals_scored] divided by [player_assists]
data_clean$ratio <- data_clean$goals_scored/data_clean$player_assistsI might want to round this variable:
data_clean$ratio <- round(data_clean$ratio,2) # round the new variable to 2 decimal places
head(data_clean) player_id player_position goals_scored player_assists player_turnovers ratio
1 1 Goal Shooter 25 8 3 3.12
2 8 Goal Shooter 30 14 4 2.14
3 10 Wing Defense 7 5 1 1.40
Finally, I’ll Write the dataset as a .csv file to an appropriate location.
write.csv(data_clean,"my_netball_data.csv")Now, create a script that conducts all of the following steps.
Begin by including the following snippet, which directs R to the file location:
url <- "https://www.dropbox.com/scl/fi/w6cmx5fgl6y5e1sizeskf/t08_data_b1700_01.csv?rlkey=trab5xa3hpqhf01ruhx1w2ers&dl=1"
data <- read.csv(url)
rm(url)Working on the dataframe ‘data’, and with the dplyr library installed and loaded:
head, str and summary commands to overview the dataset.Use the head, str and summary commands to overview the dataset.
# Overview the dataset
head(data)
str(data)
summary(data)In this dataset, ‘999’ represents an outlier. Replace these values with ‘NA’.
# Replace 999 with NA in the dataset
data[data == 999] <- NARemove any observations with missing data (‘NA’).
# Remove observations with missing data
data_clean <- na.omit(data)Remove the variable [X] from the dataset.
# Remove variable 'X'
data_clean$X <- NULLRename each of the variables to the format [var_id], [var_a] etc.
# Rename variables
names(data_clean) <- c("var_id", "var_a", "var_b", "var_c", "var_d")Change the variable types as follows: [var_id] = factor, [var_a] = factor.
# Change variable types
data_clean$var_id <- as.factor(data_clean$var_id)
data_clean$var_a <- as.factor(data_clean$var_a)Use the str function to check variable types.
# Use the 'str' function to check variable types
str(data_clean)Calculate a new variable, [var_e], which is the sum of [var_c] + [var_d].
# Calculate 'var_e' as the sum of 'var_c' + 'var_d'
data_clean$var_e <- data_clean$var_c + data_clean$var_dFirst, I load the tidyverse library and clear my environment.
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.1 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rm(list=ls())Then, I create a simple dataframe that contains the statistics of a hypothetical football player:
football_df <- data.frame(
player_name = c("John", "Mike", "Lucas", "Eva"),
goals_scored = c(5, 10, 3, 6),
assists = c(4, 6, 2, 3)
)
print(football_df) player_name goals_scored assists
1 John 5 4
2 Mike 10 6
3 Lucas 3 2
4 Eva 6 3
Convert the dataframe to a tibble and print it.
football_tibble <- as_tibble(football_df)
print(football_tibble)# A tibble: 4 × 3
player_name goals_scored assists
<chr> <dbl> <dbl>
1 John 5 4
2 Mike 10 6
3 Lucas 3 2
4 Eva 6 3
I can add a new player’s statistics to the dataframe and a new column for [games_played] using the rbind function.
football_df <- rbind(football_df, data.frame(player_name = "Sophia", goals_scored = 7, assists = 5))I can add a new variable for all players (observations) called [games_played]
football_df$games_played <- c(10, 12, 9, 11, 10)
print(football_df) player_name goals_scored assists games_played
1 John 5 4 10
2 Mike 10 6 12
3 Lucas 3 2 9
4 Eva 6 3 11
5 Sophia 7 5 10
Using dplyr, filter out players who’ve scored more than 5 goals.
top_scorers <- football_df %>% filter(goals_scored > 5)
print(top_scorers) player_name goals_scored assists games_played
1 Mike 10 6 12
2 Eva 6 3 11
3 Sophia 7 5 10
Calculate a new column [goal_per_game] and get the average goals scored by the team.
football_df <- football_df %>%
mutate(goal_per_game = goals_scored / games_played)
avg_goals <- football_df %>%
summarise(mean_goals = mean(goals_scored))
print(football_df) player_name goals_scored assists games_played goal_per_game
1 John 5 4 10 0.5000000
2 Mike 10 6 12 0.8333333
3 Lucas 3 2 9 0.3333333
4 Eva 6 3 11 0.5454545
5 Sophia 7 5 10 0.7000000
print(avg_goals) mean_goals
1 6.2
Arrange players by goals scored in descending order.
sorted_df <- football_df %>%
arrange(desc(goals_scored))
print(sorted_df) player_name goals_scored assists games_played goal_per_game
1 Mike 10 6 12 0.8333333
2 Sophia 7 5 10 0.7000000
3 Eva 6 3 11 0.5454545
4 John 5 4 10 0.5000000
5 Lucas 3 2 9 0.3333333
Group players by position and get the total goals scored for each position.
football_df$position <- c("Forward", "Midfielder", "Defender", "Forward", "Midfielder")
position_goals <- football_df %>%
group_by(position) %>%
summarise(total_goals = sum(goals_scored))
print(position_goals)# A tibble: 3 × 2
position total_goals
<chr> <dbl>
1 Defender 3
2 Forward 11
3 Midfielder 17
The following activity allows you to practise some of the techniques covered above.
You may also have to do some research to find out how to complete some of the challenges!
Install and load necessary packages (e.g. tidyverse)
Download the data from a url, and create a dataframe called [data].
https://www.dropbox.com/scl/fi/ank5v48ndxqggw6mqx6dh/football_data_999_with_position.csv?rlkey=2zqwk0tfxkq3mc6ikvxoyu8r6&dl=1
Convert 999 values (missing data) to NA.
Create a new dataframe called [data_clean] by removing observations with missing data.
Convert the above dataframe to a tibble called [football_tibble] and print it.
Delete all objects apart from [football_tibble].
Add a new player’s statistics to [football_tibble].
Using dplyr, filter out players who’ve scored fewwer than 16 goals.
Calculate a new variable [goal_per_game] and get the average goals scored by the team.
Arrange players by goals scored in descending order.
Finally, group players by position and get the total goals scored for each position.
rm(list=ls())
library(tidyverse)url <- "https://www.dropbox.com/scl/fi/ank5v48ndxqggw6mqx6dh/football_data_999_with_position.csv?rlkey=2zqwk0tfxkq3mc6ikvxoyu8r6&dl=1"
data <- read.csv(url)
rm(url)# Replace 999 with NA in the dataset
data[data == 999] <- NA
# Remove observations with missing data
data_clean <- na.omit(data)football_tibble <- as_tibble(data_clean)
print(football_tibble)rm(data, data_clean)football_tibble <- rbind(football_tibble, data.frame(player_name = "Sophia", goals_scored = 7, assists = 5, player_position = "Forward"))
print(football_tibble)dplyr, filter out players who’ve scored fewer than 16 goals.top_scorers <- football_tibble %>% filter(goals_scored > 15)
print(top_scorers)Calculate a new column [goals_and_assists] which is the sum of [goals_scored] and [assists], and get the average goals scored by the team.
football_tibble <- football_tibble %>%
mutate(goal_per_game = goals_scored + assists)
avg_goals <- football_tibble %>%
summarise(mean_goals = mean(goals_scored))
print(football_tibble)
print(avg_goals)sorted_df <- football_tibble %>%
arrange(desc(goals_scored))
print(sorted_df)position_goals <- football_tibble %>%
group_by(player_position) %>%
summarise(total_goals = sum(goals_scored))
print(position_goals)